An Efficient FFT-Mapping Method Based on Cache Optimization

نویسنده

  • ZHU Liang
چکیده

Fast Fourier Transform (FFT) is an important technology in real-time signal processing system, which means the efficiency of FFT algorithm mapping to hardware system has very important significance. At first, we aim at the FFT execution process on processor. And then analyze the memory access process based on cache mechanism, and get that the cache hit directly affects FFT execution time. Hence, an efficient mapping method is come up with, which splits long FFT into multiple segments to make sure every segment is shorter than the cache capacity. Accordingly, the cache hit rate will be improved, correspondingly, the execution efficiency will be better finally. In the end, the new method is experimented on the ADI's TS201 digital signal processor, and the result shows that the execution time of FFT is improved greatly. Key words—Fast Fourier Transform (FFT); Cache; Hit Rate Introduction FFT( Fast Fourier Transform) is one of the key technologies in modern radar signal processing. Since the FFT is an important part of the radar system, a hard real-time processing system, the processing should be done in a limited amount of time. Therefore, an efficient FFT processing is indispensable. Contemporarily, there are mainly two methods to implement the efficient FFT algorithm [1] . One of them is by ASIC or FPGA. Yet this way is lacking of flexibility and the development cycle is not short enough. At present, the other more popular method is programmatic implementation based on the processor. This method who achieves FFT algorithm through program code has a very good performance on the features of flexibility and short development cycle. With the increasing of processors’ clock speeds, the execution time of FFT algorithm has been shortened. Thus it can be seen that Fund program: NSFC(61370017) the method of programming based on the processor has more advantages. However, there is one problem that the structure of processors, which is not designed for FFT algorithm, is fixed, and they don’t match well. To achieve the targets above, we first analyze the structure of the superscalar processor, parameterize factors associated with the processing, and establish a superscalar processor execution model. And then based on the model, we choose a radix-2 FFT algorithm as an example, to analyze the disadvantage of the conventional FFT algorithm mapping method, and propose solutions of to these problems, that is the splittable FFT-mapping method. This mapping algorithm makes full use of processing resources on the processor. Combined with the features of cache, it greatly reduces the memory accessing time. Therefore, it makes maximum use of processor performance. Finally, based on the mapping algorithm, the radix-2 FFT algorithm is implemented on the ADI's TS201 digital signal processor, and achieves satisfied results. 1. The structural features of superscalar processor Superscalar pipelining techniques are used in modern processors. It is a parallel pipeline, which means execution of an instruction is divided into several parts, including addressing, decoding, executing, fetching and writing back. Then through the pipeline technology, it enables multiple instructions at different stages of processing at the same time. With the parallel method of time, it has improved the throughput of the processor. And the processor may use different execution units in running different function instructions, so that the processing of a plurality of instructions can be started in each machine cycle, making the execution efficiency to be improved [1,2] . Figure 1 shows the operating principle of a superscalar processor. 978-1-4799-6233-4/14/$31.00 ©2014 IEEE Figure 1. Schematic diagram of a superscalar pipeline. The figure 1 shows the schematic diagram of a superscalar pipeline with four instructions executing. And it is divided into six stages: 1. Reading. This stage is mainly responsible for obtaining execution codes from memory; 2. Decoding. The main task of this stage is to translate execution codes into machine readable operation codes; 3. Scheduling. The task of this stage is assigning the codes to the functional units by corresponding functions; 4. Executing. This stage is to complete instructions’ functions; 5, 6. Completing and exiting. These two stages take charge of modifying the machine state after completing the instructions, and making sure the instructions completed in order. In a superscalar processor’s architecture, stage 1, 2, 3, 5, 6 follow the same mechanism for all the instructions. Thus these five stages often work in a centralized processing way. As for stage 4, different instructions have different ways for executing. Hence this stage is designed as distributed processing [1,2,3] . As mentioned above, the superscalar processor divides the instructions into their functional units, improved processing capacity a lot. Such description is based on the assumption that execution time among each stage are all the same. However, in practical application, the data accessing time would be longer, because the core speed is much lower than the memory accessing speed. It would break the pipeline runs, reduce processing efficiency [3] . To solve this problem, the core processor was introduced with considerable speed cache. Cache takes advantage of locality principle, to store the recently accessed data in cache. Hence when accessing the memory, it would query cache first. If cache hits, the data will be got directly, corresponding, the response will be faster; If cache misses, it will start the memory access, and the accessing result will be cached [4,5,6] . By setting a reasonable cache capacity, the problem of speed mismatch can be solved, and the processor can give full play to its performance. In conclusion, the modern superscalar processor architecture adopt parallel pipeline to increase throughput and processing efficiency. To solve the problem of mismatch between the memory and the core, cache is brought in. Through the way mentioned above, we can improve the processor’s capacity. In practical application, the real-time capacity of program execution is related to two factors: one is the processor’s processing capacity and the other is the mapping of the algorithm on the processor. In consideration of this, a suitable algorithm is important to make the execution process fit the superscalar architecture feature. Hence it can maximize the processing capacity, and then implement the real-time processing. Next, we will model the execution process for fitting the modern superscalar processor, and analyze the process of FFT based on the model. And finally, propose a new effective method of mapping for FFT. 2. Effective FFT mapping method based on superscalar processor As mentioned previously, different kinds of instructtions are running in different execution units. Thus the processor can be abstracted as a set of several interrelated execution units. Define s as the number of execution units, i v as the execution speed of arbitrary unit i. Then the speed of all the units can form gather as 1 2 { , ,..., } s V v v v  . (The definitions of execution speed are different in different units: the memory units’ are defined by access bandwidth; and the core units are defined by the number of operations per unit time.) Accordingly, in order to study the real-time capacity, research the relationship between the elements of gather T, we will continue our discussion. Before the discussion, we define the parameter as follows: [1] Denote by | i j  the degree of overlapping of execution time, which means the percentage of the overlap between task j and task i from task j (The overlapping reflects the parallelism between each execution units. And the percentage is correlated with the mapping algorithm and the processor resources). [2] Denote by ( , ) i j t the total execution time of task i and task j. According to definitions above, the total execution time is ( , ) | (1 ) i j i i j j t t t      . (1) If we regard task i and task j as a big entire task I, the total execution time of task I and one another task in the gather C is ( , ) ( , )| (1 ) Ik i j i j k k t t t      . (2) Using recursive algorithm, we can get the total execution time of the tasks as

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Performance Tuning of the Fast Fourier Transform on a Multi-core Architecture

We are now entering the multi-core era, many multi-core chips are designed and manufactured by various vendors, such as Intel, AMD and Sun etc. IBM Cyclops-64(C64) is a multi-core architecture that provides massive on-chip parallelism, massive on-chip bandwidth, and multiple level memory hierarchy. This type of multi-core architecture presents big challenges to application developers and system...

متن کامل

Systematic Exploration of Trade-Offs between Application Throughput and Hardware Resource Requirements in DSP Systems

Title of dissertation: SYSTEMATIC EXPLORATION OF TRADE-OFFS BETWEEN APPLICATION THROUGHPUT AND HARDWARE RESOURCE REQUIREMENTS IN DSP SYSTEMS Hojin Kee, Doctor of Philosophy, 2010 Dissertation directed by: Shuvra S. Bhattacharyya, Professor Department of Electrical and Computer Engineering, and Institute for Advanced Computer Studies Dataflow has been used extensively as an efficient model-of-co...

متن کامل

FFT based solution for multivariable L2 equations using KKT system via FFT and efficient pixel-wise inverse calculation

When solving l2 optimization problems based on linear filtering with some regularization in signal/image processing such asWiener filtering, the fast Fourier transform (FFT) is often available to reduce its computational complexity. Most of the problems, in which the FFT is used to obtain their solutions, are based on single variable equations. On the other hand, the Karush-Kuhn-Tucker (KKT) sy...

متن کامل

Number Theory Meets Cache Locality – Efficient Implementation of a Small Prime FFT for the GNU Multiple Precision Arithmetic Library

When multiplying really large integer operands, the GNU Multiple Precision Arithmetic Library uses a method based on the Fast Fourier Transform. To make an algorithm execute quickly on a modern computer, data has to be available in the cache memory. If that is not the case, a large portion of the execution time will be spent accessing the main memory. It might pay off to perform much extra work...

متن کامل

An Efficient Scheme for PAPR Reduction of OFDM based on Selected Mapping without Side Information

Orthogonal frequency division multiplexing (OFDM) has become a promising method for many wireless communication applications. However, one main drawback of OFDM systems is the high peak-to-average power ratio (PAPR). Selected mapping (SLM) is a well-known technique to decrease the problem of high PAPR in OFDM systems. In this method, transmitter is obliged to send some bits named side informati...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014